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Abstract 

We  present  an  efficient  and  accurate  method  for  transferring  annotations  between  two  different  treebanks  of  the  same  language.  This 
method  led  to  the  creation  of  a  new  instance  of  the  French  Treebank  (Abeille  et  al.,  2003),  which  follows  the  Universal  Dependency 
annotation  scheme  and  which  was  proposed  to  the  participants  of  the  CoNLL  2017  Universal  Dependency  parsing  shared  task  (Zeman  et 
al.,  2017).  Strong  results  from  an  evaluation  on  our  gold  standard  (94.75%  of  LAS,  99.40%  UAS  on  the  test  set)  demonstrate  the  quality 
of  this  new  annotated  data  set  and  validate  our  approach. 
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1.  Introduction 

After  many  decades  of  treebanking  initiatives  (Einarsson, 
1976;  Marcus  et  al.,  1993),  the  interest  in  developing  an¬ 
notated  corpora  no  longer  needs  to  be  justified.  Although 
a  distinction  can  be  noted  between  treebanks  created  for 
linguistic  purposes  and  those  only  conceived  in  a  natural 
language  processing  perspective,  it  tends  to  fade  away  in 
the  face  of  the  ever  growing  machine  learning  addiction  to 
new  sources  of  labeled  data.  In  fact,  not  only  can  any  an¬ 
notated  corpus  be  used  as  a  primary  or  secondary  source 
of  training  data  within  more  or  less  complex  systems,  but 
hand-crafted  syntactic  resources  such  as  grammars  and  lex¬ 
icons  can  be  used  as  sources  of  features  to  guide  data  driven 
systems  (0vrelid  et  al.,  2009;  Villemonte  De  La  Clergerie, 
2014a).  The  crucial  point  here  lies  in  the  interopability  of 
such  heterogenous  sources  of  information.  Before  the  rise 
of  the  Universal  Dependency  initiative  (Nivre  et  al.,  2017) 
and  its  eponymous  scheme,  henceforth  UD,  which  resulted 
in  the  release  of  81  treebanks  on  more  than  50  language, 
the  situation  was  at  best  complicated.  Nevertheless,  the  pre- 
UD  multitude  of  annotation  schemes  allowed  many  to  use 
stacking  methodologies  for  predicting  syntactic  annotations 
of  a  certain  type  and  following  specific  guidelines  (e.g.  UD 
dependencies)  with  the  help  of  other  types  of  annotations 
that  follow  different  schemes,  sometimes  even  of  a  different 
topological  nature  (Farkas  and  Bohnet,  2012;  Bjorkelund  et 
al.,  2013;  Ambati  et  al.,  2013;  Ribeyre  et  al.,  2015).  In  most 
cases,  taking  into  account  such  heterogenous  syntactic  in¬ 
formation  in  the  form  of  additional  features  does  improve 
parsing  accuracy. 

Unsurprisingly,  the  performance  gain  is  generally  outstand¬ 
ing  whenever  such  features  are  extracted  from  gold  anno¬ 
tations.  When  the  goal  is  to  produce  new  reference  anno¬ 
tated  data,  such  an  performance  gain  results  in  fewer  post¬ 
annotation  corrections.  In  case  of  converting  one  treebank 
to  another  annotation  scheme,  such  gold  information  is  of 
course  readily  available  and  has  the  potential  to  consider¬ 
ably  ease  this  process. 

In  this  paper,  we  describe  such  a  conversion  effort,  for 
which  we  had  to  meet  with  another  drastic  constraint;  in 


the  context  of  the  preparation  of  the  CoNLL  2017  shared 
task  on  “Multilingual  Parsing  from  Raw  Text  to  Universal 
Dependencies”  (Zeman  et  al.,  2017),  we  had  less  than  two 
weeks  for  converting  the  French  Treebank  (Abeille  et  al., 
2003,  hereafter  FTB)  in  its  SPMRL^  dependency  version 
(Seddah  et  al.,  2013)  into  a  new  one  that  complies  with  the 
UD  guidelines. 

Such  an  objective  forced  us  to  think  of  all  possible  tech¬ 
niques  that  could  help  producing  a  treebank  that  would  fol¬ 
low  the  UD  scheme  with  the  best  possible  accuracy.  Since 
we  were  to  produce  a  new  data  set,  the  use  of  a  data-driven 
process  fed  with  gold  features  whenever  possible  was  the 
only  way  out.  The  result  of  our  conversion  process,  as  mea¬ 
sured  on  a  silver  standard  in  terms  of  labeled  attachment  ac¬ 
curacy  (LAS),  reaches  around  98.50%  on  the  Sequoia  UD 
Treebank  (Candito  and  Seddah,  2012;  Nivre  et  al.,  2017). 
Against  a  smaller  and  manually  validated  subset,  we  reach 
94.75%  of  LAS  and  99.42  for  unlabeled  attachment  score. 
These  scores  are  likely  to  reflect  the  high  quality  of  our  re¬ 
sulting  data  set. 

In  the  remaining  of  this  paper,  we  describe  the  methodology 
we  used  to  build  the  UD  version  of  the  FTB,  hereafter  Ftb- 
UD,  and  present  our  evaluation  process  and  results.  The 
Ftb-UD  is  available  under  the  same  licence  conditions  as 
the  original  Ftb.^ 

2.  Method  Overview 

The  basic  idea  is  the  following:  we  had  access  to  a  rule- 
based  system  for  automatically  converting  another  tree- 
bank,  namely  the  French  Sequoia  Treebank  (Candito  and 
Seddah,  2012,  hereafter  SEQUOIA),  into  UD.  After  adapt¬ 
ing  the  Ftb’s  native  tokenization  scheme  to  UD,  this  con¬ 
version  system  was  directly  applied  to  the  Ftb.  This  re¬ 
sulted  in  many  errors;  16%  of  the  sentences  contained  one 
or  more  errors  at  one  or  more  levels  (POS,  dependency, 
head),  between  6  and  7%  of  tokens  were  flagged  as  Fail- 


*  Statistical  Parsing  of  Morphologically-Rich  Languages. 

^https : / /github . com/ 
UniversalDependencies/UD_French-FTB 


4535 


f  ;ol(l  f'  ■‘lire 
from  i:  'ti‘ 
scheme 
iiijeciiun 


^  Failed  ^ 


NoFailcd 


Merging 


Random  Noise 

Dynamic  oracle 

/  N 

Cross-treebank 

Injection 

parser  training 

^  parsing  model 

Figure  1:  Overview  of  our  cross-treebank  parser  training  process 


ure  conversion,  leading  of  course  to  many  more  in¬ 
correct  tree  structures.  The  Ftb  being  six  times  larger  than 
Sequoia,  adapting  and  extending  the  initial  set  of  rules 
was  not  feasible  in  such  a  short  time.  We  automatically  cor¬ 
rected  incorrect  coordination  tree  structures  and  manually 
corrected  missing  POS  resulting  from  conversion  failures. 
We  then  decided  to  reparse  all  error-flagged  dependencies 
using  our  robust  shift-reduce  parser  with  dynamic  oracle 
(Villemonte  De  La  Clergerie,  2013). 

The  idea  was  to  build  a  pseudo  gold  training  set  (made 
of  90%  of  the  Sequoia  treebank  and  of  the  FTB  train¬ 
ing  sentences  that  contained  no  conversion  errors,  leaving 
aside  20%  of  those  for  pseudo-gold  evaluation)  to  which 
we  injected  both  (i)  external  gold  morpho-syntactic  fea¬ 
tures  coming  from  the  Ftb  SPMRL  version  and  (ii)  ran¬ 
dom  noise,  such  as  empty  dependencies,  in  the  same  pro¬ 
portions  as  the  initial  conversion  errors  (see  Figure  1  for  an 
overview  of  the  training  process).  We  then  parsed  all  erro¬ 
neous  sentences  (all  incorrect  edges  were  deleted)  with  this 
model  with  the  hypothesis  that  the  parser  would  be  able  to 
predict  correct  dependencies  assuming  the  proper  external 
gold  features  were  to  be  provided. 

3.  Building  the  Ftb-UD 

Besides  providing  another  source  of  annotated  French  data 
to  the  CoNLL  2017  shared  task  participants,  our  primary 
goal  was  to  enable  cross-parsing  comparisons  between  dif¬ 
ferent  annotation  schemes,  namely  the  native  Ftb  depen¬ 
dency  scheme  (Candito  et  ah,  2010)  as  instantiated  in  the 
SPMRL  shared  tasks  (Seddah  et  ah,  2013;  Seddah  et  ah, 
2014)  and  the  then  upcoming  UD  2.0  scheme  (Nivre  et  ah, 
2017)  that  was  to  be  used  for  this  shared  task  (Zeman  et 
ah,  2017).  For  these  reasons,  our  starting  point  is  the  Ftb 
SPMRL  instance  and  not  its  latest  incarnation.^ 

3.1.  Multi-word  Expression  Treatment 

We  started  by  adapting  the  annotation  scheme  for  multi¬ 
word  expressions  (MWEs).  The  treebank  with  less  types 
of  MWEs  annotated  is  the  Sequoia  treebank,  containing 
fixed  functional  MWEs.  We  thus  used  the  existing  rule- 
based  software  of  Candito  and  Crabbe  (2009)  to  “undo” 
non  functional  MWEs,  namely  to  recover  a  regular  syn¬ 
tactic  structure  for  regular  nominal,  adjectival,  verbal  and 
adverbial  MWEs.  The  patterns  for  spotting  and  undoing 


^http : // ftb . linguist . univ-paris-diderot . 
f  r,  released  in  December  2016. 


MWEs  are  a  subset  of  those  of  Candito  and  Crabbe  (2009). 
All  the  remaining  MWEs  were  then  represented  using  the 
fixed  dependency  label,  used  for  functional  MWEs.  This 
choice  can  be  discussed  in  the  light  of  the  current  debate 
within  the  UD  community  regarding  the  status  to  give  to 
named  entities.  Eor  example,  the  Etb  contains  many  nomi¬ 
nal  named  entities  (tagged  N  N,  e.g  for  persons),  assuming 
a  proper  disambiguation  step,  those  could  have  received  a 
f  lat :  name  label  instead. 

However,  we  then  adapted  the  word  segmentation  to  that 
of  Erench  UD  2.0,  the  main  difference  concerning  amal¬ 
gamated  prepositions:  e.g.  the  amalgamated  preposi- 
tionH-determiner  au  (litt.  “to  the")  is  systematically  treated 
as  one  token  but  two  words  (d  (to)  and  le  (the)). 

3.2.  Application  of  Sequoia  to  UD  rule-based 
converter 

Before  we  started  this  work,  another  research  team  was 
working  on  the  conversion  of  the  SEQUOIA  treebank  to  the 
UD  annotation  scheme  (Guillaume  et  ah,  to  appear)  using 
their  graph  rewriting  engine  (Guillaume  et  ah,  2012).  Be¬ 
cause  the  SequqIA  treebank  native  annotation  scheme  uses 
the  same  guidelines  as  the  Etb,  the  use  of  the  rule-set  they 
developed  was  favored  in  order  to  bootstrap  the  conversion 
process.  However,  both  corpora  differ  considerably  in  size 
(resp.  3k  vs  18k  sentences)  and  domains  (wikipedia,  eu- 
roparl,  biomedical  for  SEQUOIA,  international  and  national 
news-wire  for  the  Etb),  leading  the  application  of  the  SE¬ 
QUOIA  to  UD  conversion  process  to  a  new  domain  to  be 
non-trivial.  As  we  mentioned  in  the  previous  section,  the 
resulting  treebank  contained  16%  of  sentences  with  one  or 
more  errors  and  6%  after  correction  of  some  coordinate 
structures.  The  next  two  sections  describe  how  we  cor¬ 
rected  those  errors. 

3.3.  POS  Correctiou  aud  lujectiou  of  Gold 
Features 

POS -correction  The  application  of  the  conversion  rules 
resulted  in  a  failure  to  produce  a  POS  tag  for  89  word- 
forms  (6 1  in  the  training  set,  3  in  the  development  set,  25  in 
the  test  set).  We  manually  reviewed  and  POS-annotated  all 
these  cases. 

Injection  of  Morpho-syntactic  Gold  Features  We 

first  developed  an  algorithm  for  automatically  post-align 

"'Please  note  that  the  version  distributed  for  the  UD  Shared 
Task  did  not  contain  this  regularization,  which  will  be  included 
in  the  next  major  release. 
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the  output  of  the  conversion  with  the  original  Ftb  SPMRL 
files,  which  differ  in  how  they  are  segmented  into  tokens 
and  wordforms.  This  algorithm  reads  both  versions  of 
the  same  sentences,  stores  wordforms  from  each  file  and 
multi-wordform  tokens  form  the  converted  version.  It  then 
aligns  tokens  using  a  robust  synchronization  algorithm  that 
traverses  both  token  sequences  for  a  given  sentence  in  a 
left-to-right  manner.  Whenever  tokens  do  not  match,  the 
algorithm  performs  a  lookahead  on  both  token  sequences 
until  it  is  able  to  find  a  new  anchor  point,  the  “forward 
anchor”.  The  search  for  a  forward  anchor  is  itself  robust 
to  tokenization  mismatches,  making  use  of  the  notion  of 
“weak  match”  only  used  for  comparing  right  contexts.  The 
notion  of  weak  match  is  defined  as  a  disjunction  of  pat¬ 
terns;  the  main  pattern  looks  for  two  consecutive  matches 
or  “pseudo-matches”  between  tokens  in  the  original  token 
sequence  and  tokens  in  the  converted  token  sequence.^ 
Once  a  forward  anchor  is  found,  tokens  between  the  current 
position  and  the  forward  anchor  are  aligned  according  to  a 
finite  number  of  patterns,  some  of  which  are  aware  of  the 
discrepancy  in  how  some  prepositions  and  determiners  are 
agglutinated  in  the  original  tokenization  scheme  (e.g.  des 
<  de  les). 

Next,  for  each  converted  token  which  is  aligned  with 
an  original  token,  its  gold  syntactic  information 
is  extracted  from  the  original  SPMRL  annotations 
(gold_SPMRL_head,  gold_SPMRL_fpos , 
gold_SPMRL_delta,  gold_SPMRL_label ) 
and  associated  with  the  converted  token  in  the  form  of 
additional  features,  appended  for  convenience  to  the  rele¬ 
vant  field.  These  features  respectively  provide  information 
about  the  head,  fine-grained  POS,  distance  from  the 
governor  and  label  of  the  current  word’s  governor. 

3.4.  Parsing-based  Treebank  Correction 

Inspired  what  had  been  tried  when  stacking  a  symbolic 
parser  with  DyALog-SR  (Villemonte  de  la  Clergerie, 
2014b),  guiding  gold  features  pseudogold_UD_label 
and  pseudogold_UD_delta  were  added  based  on  the 
result  of  the  preliminary  automatic  conversion.  They  re¬ 
spectively  refer  in  this  preliminary  UD  version  to  the  label 
and  the  (ordered)  distance  to  the  governor  (if  any).  Obvi¬ 
ously,  with  such  features,  which  are  not  gold  because  of 
conversion  errors  but  nevertheless  quite  accurate,  training 
looks  like  a  rather  trivial  task!  However,  based  on  a  random 
process,  about  6%  of  these  guiding  features  were  deleted,  in 
order  for  the  parser  to  learn  how  to  correct  a  certain  amount 
of  errors,  based  on  information  about  nearby  dependen¬ 
cies,  words,  POS,  and  obviously  SPMRL-based  gold  fea¬ 
tures  added  as  per  the  previous  section.  It  should  be  also 
noted  that  because  all  these  feature  are  only  indicative,  the 
parser  may  even  learn  not  to  follow  them  under  some  con- 


^For  instance  a  token ^ce  ‘in  front’  in  the  converted  token  se¬ 
quence  will  be  considered  as  a  pseudo-match  with  a  token  ^ce_d 
‘in  front  of’  in  the  original  token  sequence.  This  pseudo-match 
will  result  in  an  offset  of  1  on  the  converted  side,  in  order  to  skip 
the  probable  token  d  that  follows  the  converted  token^ce.  A  weak 
match  will  therefore  be  found  if  the  converted  token  following  this 
fl  is  a  match  or  pseudo-match  with  the  token  following  ^ce_d  in 
the  original  token  sequence. 


ditions,  in  other  words,  decide  that  some  gold  annotations 
are  actually  maybe  not  so  correct. 

Clearly,  that  kind  of  scheme  (introducing  a  small  amount  of 
error)  can  not  only  be  used  to  correct  errors  when  convert¬ 
ing  to  a  new  annotation  schema  (as  tried  here)  but  also  to 
track  and  correct  errors  in  gold  annotations. 

Initially  developed  for  participating  to  the  SPMRL  shared 
task,  the  parser  we  used,  DyALog-SR,  is  a  shift-reduce  de¬ 
pendency  parser,  using  Dynamic  Programming  and  beams 
to  explore  its  search  space  and  a  feature-rich  perceptron 
to  weight  the  parser  actions  (Villemonte  De  La  Clergerie, 
2013).  Early  and  aggressive  updates  of  the  perceptron  are 
performed  at  training  time.  In  particular,  following  ideas 
from  dynamic  oracles  (Goldberg  and  Nivre,  2012),  updates 
may  occur  for  actions  that  clearly  results  in  violations  of  the 
gold  tree,  for  instance  when  adding  a  bad  dependency. 
Using  such  a  setting,  our  model  was  able  to  provide  a  high 
level  of  performance  on  the  SEQUOIA  gold  data  (10%  not 
used  in  the  training  data  and  parsed  with  the  same  config¬ 
uration  as  the  data  we  aimed  to  correct)  with  98.50%  of 
LAS.  The  same  range  of  accuracy  was  achieved  on  the  dev 
and  test  section  of  the  Ftb  that  contained  no  conversion 
errors  (resp.  98.48  and  98.64%  of  LAS). 

4.  Evaluation 

Treebank  conversion  is  a  laborious  task  full  of  minutiae, 
and  many  conversion  efforts  improve  their  conversion  in  an 
iterative  fashion,  or  as  new  relevant  conversion  needs  are 
identified.  A  full  manual  evaluation  of  a  converted  treebank 
could  represent  an  effort  comparable  to  full  re-annotation  of 
a  large  part  of  the  data.  Indeed,  few  of  the  UD-conversion 
papers  provide  accuracy  scores  of  the  conversion  on  a  man¬ 
ually  annotated  testbench. 

For  instance.  The  Danish  conversion  of  Johannsen  et  al. 
(2015),  uses  a  small  set  of  hand-annotated  sentences  that 
reflect  specific  phenomena  and  hard  cases  that  is  used  as 
held-out  section  during  the  iterative  development  of  con¬ 
version  rules.  The  Hungarian  conversion  of  Vincze  et  al. 
(2017)  uses  a  hand-corrected  gold  standard  of  1,800  sen¬ 
tences.  When  comparing  the  quality  of  the  conversion  with 
the  gold  standard,  they  consider  the  accuracy  (87.81  UAS 
and  75.99  LAS)  not  sufficient  to  release  the  resulting  tree- 
bank. 

We  draw  inspiration  on  their  method  to  develop  a  hand- 
corrected  sample  to  evaluate  the  quality  of  our  conver- 
sion.One  of  the  authors  of  the  article,  an  expert  in  depen¬ 
dency  annotation  very  familiar  with  the  UD  formalism,  re¬ 
viewed  100  sentences  from  the  test  section  and  100  sen¬ 
tences  from  the  dev  section  manually,  correcting  edges  and 
labels  that  were  either  not  properly  attached,  or  not  compli¬ 
ant  with  UD2. 

Table  1  shows  the  scores  for  the  manual  validation.  The 
Unlabeled  Attachment  Score  (UAS)  is  very  high,  as  the  an¬ 
notator  did  not  disagree  with  most  of  the  edges  resulting 
from  the  conversion.  However,  the  results  are  more  drastic 
when  analyzing  the  quality  of  the  labels. 

If  we  examine  the  label  corrections  by  the  expert  annota¬ 
tion,  we  find  that  most  of  them  reside  on  the  label  fixed, 
which  has  been  used  conservatively  for  all  associated  mul¬ 
tiword  expressions.  Out  of  360  relabelings  overall,  274  are 
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Section  UAS  LAS 

Dev  99.44  93.27 

Test  99.40  94.75 


Table  1:  Manual  evaluation  scores  for  100-sentences  ex¬ 
cerpts  from  the  dev  and  test  section. 

relabelings  for  edges  converted  into  fixed  that  should  other¬ 
wise  be  compound  oxflatmame. 

While  some  of  the  corrections  for  ths  fixed  relation  can  be 
automated  depending  on  the  syntactic  role  of  overall  multi¬ 
word  subtree — e.g.  a  subtree  that  works  as  case  is  a  multi¬ 
word  adposition  and  should  be  labeled while  a  nsubj 
label  would  per  a  proper  name  or  a  compound — the  distinc¬ 
tion  between  these  tree  types  of  relations,  that  are  not  ex¬ 
actly  dependency  relations  in  nature  but  must  be  described 
as  such  by  virtue  of  the  UD  formalism,  requires  per-item 
linguistic  analysis. 

We  have  not  observed  any  cases  of  mis-conversion  of  the 
core  nominal  arguments  of  verbs,  which  means  that  sub¬ 
jects  and  objects  are  always  properly  annotated,  as  well  as 
the  root  note.  In  general,  missattachments  happen  at  lower 
points  of  the  dependency  tree  that  are  closer  to  the  leaves 
and  are  thus  less  relevant  for  overall  dependency  quality 
(Plank  et  ah,  2015). 

After  multiword  expressions,  there  are  roughly  thirty  cases 
where  the  expert  determined  that  the  preferred  relation 
should  have  been  either  appos  (apposition)  or  parataxis. 
These  are  already  controversial  labels  and  are  not  easy  to 
annotate.  However,  this  indicates  that  the  quality  of  the 
treebank  is  high  enough  for  the  most  frequent  expert  re¬ 
labelings  to  be  within  the  domain  of  the  fine  distinctions 
of  syntactic-semantic  relations.  Indeed,  there  was  only  one 
sentence  out  of  the  pooled  200  where  there  were  present  er¬ 
rors  caused  by  coordination  embedding,  where  the  tree  had 
to  be  corrected  for  the  inner  coordinates  not  to  attach  out¬ 
side  of  the  scope  of  their  closest  subsuming  coordination. 

5.  Conclusion 

We  have  described  our  effort  to  provide  a  highly  reliable 
conversion  of  FTB  into  UD2.0  based  on  a  convert-then- 
reparse  principle.  This  method  provides  very  high  unla¬ 
beled  accuracy  (99.42  on  average  between  200  sentences). 
However,  the  quality  of  the  resulting  treebanks  will  need  to 
be  kept  up  to  date  with  the  advancements  in  the  UD  formal¬ 
ism,  including  a  more  homogeneous  treatment  of  parataxis 
and  appositions,  as  well  as  a  detailed  per-item  analysis  of 
multiword  expressions  and  their  potential  relabeling.  This 
method  will  be  applied  to  the  French  Question  Bank  (Sed- 
dah  and  Candito,  2016)  and  to  other  data  sets  for  English. 
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